Telecommunications services customer churn

Telecom image
Telecom image

ASK: Define the business problem, project objective, invloved stakeholders,etc.

  • Business Task: Reduce the churn rate by 5% to 10% by offering proactive incentives to “at-risk” customers.
  • Objective: Build a predictive model to identify customers who are likely to cancel their subscription service.
  • Key Stakeholders: Retention Marketing Team and Customer Support.
  • Question: What are the top three predictors of a customer leaving, and can we use these predictors to predict churn with at least 75% sensitivity?

Prepare: Data source, structure and integrity

This data was made publicly available by IBM, and downloaded from Kaggle.
- Data Integrity: This data does not include Personal Identifiable Information like Phone numbers, Identity numbers or credit card numbers, so users privacy is protected.
- Data location: Click this link to find the dataset
- Data Organization: the data is organised as a .csv file named “WA_Fn-UseC_-Telco-Customer-Churn.csv”.
- Tool Selection: I choose to use r, for its robust statistical packages and clean data manipulation syntax.


Process: Load, Clean and Transform the dataset

Setting up the environment

library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library ("caret")
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library("randomForest")
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
telco_customer_churn <- read_csv("datasets/telco_customer_churn.csv")
## Rows: 7043 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): customerID, gender, Partner, Dependents, PhoneService, MultipleLin...
## dbl  (4): SeniorCitizen, tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View and Rename columns

print("Here is a list of the column names:")
## [1] "Here is a list of the column names:"
colnames(telco_customer_churn)
##  [1] "customerID"       "gender"           "SeniorCitizen"    "Partner"         
##  [5] "Dependents"       "tenure"           "PhoneService"     "MultipleLines"   
##  [9] "InternetService"  "OnlineSecurity"   "OnlineBackup"     "DeviceProtection"
## [13] "TechSupport"      "StreamingTV"      "StreamingMovies"  "Contract"        
## [17] "PaperlessBilling" "PaymentMethod"    "MonthlyCharges"   "TotalCharges"    
## [21] "Churn"
telco_customer_churn <- telco_customer_churn %>% 
  rename( customer_id = customerID,
          senior_citizen = SeniorCitizen,
          partner = Partner,
          dependents = Dependents,
          phone_service = PhoneService,
          multiple_lines = MultipleLines,
          internet_service = InternetService,
          online_security = OnlineSecurity,
          online_backup = OnlineBackup,
          device_protection = DeviceProtection,
          tech_support = TechSupport,
          streaming_tv = StreamingTV,
          streaming_movies = StreamingMovies,
          contract = Contract,
          paperless_billing = PaperlessBilling,
          payment_method = PaymentMethod,
          monthly_charges = MonthlyCharges,
          total_charges = TotalCharges,
          churn = Churn)
print("Nineteen columns have been renamed here are the current column names:")
## [1] "Nineteen columns have been renamed here are the current column names:"
colnames(telco_customer_churn)
##  [1] "customer_id"       "gender"            "senior_citizen"   
##  [4] "partner"           "dependents"        "tenure"           
##  [7] "phone_service"     "multiple_lines"    "internet_service" 
## [10] "online_security"   "online_backup"     "device_protection"
## [13] "tech_support"      "streaming_tv"      "streaming_movies" 
## [16] "contract"          "paperless_billing" "payment_method"   
## [19] "monthly_charges"   "total_charges"     "churn"

Check for missing (NA) values

num_na <- telco_customer_churn  %>%
  summarise(across(everything(), ~ sum(is.na(.))))
num_na
## # A tibble: 1 × 21
##   customer_id gender senior_citizen partner dependents tenure phone_service
##         <int>  <int>          <int>   <int>      <int>  <int>         <int>
## 1           0      0              0       0          0      0             0
## # ℹ 14 more variables: multiple_lines <int>, internet_service <int>,
## #   online_security <int>, online_backup <int>, device_protection <int>,
## #   tech_support <int>, streaming_tv <int>, streaming_movies <int>,
## #   contract <int>, paperless_billing <int>, payment_method <int>,
## #   monthly_charges <int>, total_charges <int>, churn <int>

Convert total_charges to numeric (this creates the NAs from the blank spaces)

class(telco_customer_churn$total_charges)
## [1] "numeric"
telco_customer_churn <- telco_customer_churn %>%
  mutate(total_charges = as.numeric(as.character(total_charges)))
class(telco_customer_churn$total_charges)
## [1] "numeric"

Convert the churn column to a Factor

class(telco_customer_churn$churn)
## [1] "character"
telco_customer_churn$churn <- as.factor(telco_customer_churn$churn)
class(telco_customer_churn$churn)
## [1] "factor"

Remove the eleven empty rows

telco_customer_churn <- telco_customer_churn %>% 
  drop_na(total_charges)
num_na <- telco_customer_churn  %>%
  summarise(across(everything(), ~ sum(is.na(.))))
num_na
## # A tibble: 1 × 21
##   customer_id gender senior_citizen partner dependents tenure phone_service
##         <int>  <int>          <int>   <int>      <int>  <int>         <int>
## 1           0      0              0       0          0      0             0
## # ℹ 14 more variables: multiple_lines <int>, internet_service <int>,
## #   online_security <int>, online_backup <int>, device_protection <int>,
## #   tech_support <int>, streaming_tv <int>, streaming_movies <int>,
## #   contract <int>, paperless_billing <int>, payment_method <int>,
## #   monthly_charges <int>, total_charges <int>, churn <int>

Check for duplicates

num_duplicates <- sum(duplicated(telco_customer_churn))
num_duplicates
## [1] 0

Remove the customer_id column since it is not necessary

telco_customer_churn <- telco_customer_churn %>% 
  select(-customer_id)
head(telco_customer_churn)
## # A tibble: 6 × 20
##   gender senior_citizen partner dependents tenure phone_service multiple_lines  
##   <chr>           <dbl> <chr>   <chr>       <dbl> <chr>         <chr>           
## 1 Female              0 Yes     No              1 No            No phone service
## 2 Male                0 No      No             34 Yes           No              
## 3 Male                0 No      No              2 Yes           No              
## 4 Male                0 No      No             45 No            No phone service
## 5 Female              0 No      No              2 Yes           No              
## 6 Female              0 No      No              8 Yes           Yes             
## # ℹ 13 more variables: internet_service <chr>, online_security <chr>,
## #   online_backup <chr>, device_protection <chr>, tech_support <chr>,
## #   streaming_tv <chr>, streaming_movies <chr>, contract <chr>,
## #   paperless_billing <chr>, payment_method <chr>, monthly_charges <dbl>,
## #   total_charges <dbl>, churn <fct>

Save the clean dataset

write_csv(telco_customer_churn,"C:\\Users\\itumeleng\\Documents\\R_programs\\datasets\\processed\\clean_telco_customer_churn.csv")

Analyse: Explore, aggregate and model the data

Get a statistical summary of the entire tibble

summary(telco_customer_churn) 
##     gender          senior_citizen     partner           dependents       
##  Length:7032        Min.   :0.0000   Length:7032        Length:7032       
##  Class :character   1st Qu.:0.0000   Class :character   Class :character  
##  Mode  :character   Median :0.0000   Mode  :character   Mode  :character  
##                     Mean   :0.1624                                        
##                     3rd Qu.:0.0000                                        
##                     Max.   :1.0000                                        
##      tenure      phone_service      multiple_lines     internet_service  
##  Min.   : 1.00   Length:7032        Length:7032        Length:7032       
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :29.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :32.42                                                           
##  3rd Qu.:55.00                                                           
##  Max.   :72.00                                                           
##  online_security    online_backup      device_protection  tech_support      
##  Length:7032        Length:7032        Length:7032        Length:7032       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  streaming_tv       streaming_movies     contract         paperless_billing 
##  Length:7032        Length:7032        Length:7032        Length:7032       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  payment_method     monthly_charges  total_charges    churn     
##  Length:7032        Min.   : 18.25   Min.   :  18.8   No :5163  
##  Class :character   1st Qu.: 35.59   1st Qu.: 401.4   Yes:1869  
##  Mode  :character   Median : 70.35   Median :1397.5             
##                     Mean   : 64.80   Mean   :2283.3             
##                     3rd Qu.: 89.86   3rd Qu.:3794.7             
##                     Max.   :118.75   Max.   :8684.8

Get total number of customers

num_customers <- telco_customer_churn %>% 
  summarise(total = n()) %>% 
  pull(total)
num_customers
## [1] 7032

Group and compare customers using churn coluimn

churn_summary <- telco_customer_churn %>%
  group_by(churn) %>%
  summarise(average_tenure = mean(tenure),
    monthly_avgerage = mean(monthly_charges),
    count = n())
churn_summary <- churn_summary %>% 
  mutate(percentage = round(count / num_customers * 100,2))
churn_summary
## # A tibble: 2 × 5
##   churn average_tenure monthly_avgerage count percentage
##   <fct>          <dbl>            <dbl> <int>      <dbl>
## 1 No              37.7             61.3  5163       73.4
## 2 Yes             18.0             74.4  1869       26.6

Split Data for model training

set.seed(42)
train_index <- createDataPartition(telco_customer_churn$churn, p = 0.8, list = FALSE)
train_data <- telco_customer_churn[train_index, ]
test_data  <- telco_customer_churn[-train_index, ]

Train the model

rf_model <- randomForest(churn ~ ., data = train_data, ntree = 100, importance = TRUE)

Use the model to make predictions

predictions <- predict(rf_model, test_data)

Get the probability of predictions

prob_predictions <- predict(rf_model, test_data, type = "prob")

Set a threshold of 30% to and make new more accurate predictions

custom_threshold <- 0.30
new_predictions <- ifelse(prob_predictions[, "Yes"] > custom_threshold, "Yes", "No")

Convert back to factors so confusionMatrix works

new_predictions <- factor(new_predictions, levels = c("No", "Yes"))

Evaluate the predictions made against actual data

conf_matrix <- confusionMatrix(predictions, test_data$churn, positive = "Yes")

Evaluate the new predictions made against actual data

new_conf_matrix <- confusionMatrix(new_predictions, test_data$churn, positive = "Yes")

Share: Communicate insights through visuals

Pie chart to visualize churn rate

ggplot(churn_summary, aes(x = "", y = percentage, fill = churn)) + 
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y",start = 0) +
  scale_fill_manual(values = c("No" = "green", "Yes" = "red")) +
  theme_void() +
  labs(title = "Overall Customer Churn Rate") +
  geom_text(aes(label = paste0(percentage, "%")), 
            position = position_stack(vjust = 0.5), 
            color = "white", size = 5)

Density plot of Tenure

ggplot(telco_customer_churn, aes(x = tenure, fill = churn)) +
  geom_density(alpha = 0.5) +
  labs(title = "When do Customers Leave? (Tenure Distribution)",
       x = "Months with Company",
       y = "Density") +
  theme_minimal() +
  scale_fill_manual(values = c("No" = "green", "Yes" = "red"))

Initial model confusion matrix

conf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  936 205
##        Yes  96 168
##                                          
##                Accuracy : 0.7858         
##                  95% CI : (0.7634, 0.807)
##     No Information Rate : 0.7345         
##     P-Value [Acc > NIR] : 5.097e-06      
##                                          
##                   Kappa : 0.3942         
##                                          
##  Mcnemar's Test P-Value : 4.815e-10      
##                                          
##             Sensitivity : 0.4504         
##             Specificity : 0.9070         
##          Pos Pred Value : 0.6364         
##          Neg Pred Value : 0.8203         
##              Prevalence : 0.2655         
##          Detection Rate : 0.1196         
##    Detection Prevalence : 0.1879         
##       Balanced Accuracy : 0.6787         
##                                          
##        'Positive' Class : Yes            
## 

Initial Model Summary (Standard 50% Threshold)

The initial Random Forest model, utilizing a default 50% probability threshold, achieved an overall accuracy of 78.65%. While this model was highly effective at identifying loyal customers who were likely to stay 90.89% specificity), it performed poorly regarding the primary business task of detecting churn. With a sensitivity of only 44.77%, the model failed to identify more than half of the actual churners in the test set—missing 206 at-risk customers. This “conservative” approach resulted in fewer false alarms, but it left the business vulnerable to losing a significant portion of its customer base without prior warning.

Optimized model confusion matrix

new_conf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  816 128
##        Yes 216 245
##                                           
##                Accuracy : 0.7552          
##                  95% CI : (0.7318, 0.7774)
##     No Information Rate : 0.7345          
##     P-Value [Acc > NIR] : 0.04168         
##                                           
##                   Kappa : 0.4162          
##                                           
##  Mcnemar's Test P-Value : 2.722e-06       
##                                           
##             Sensitivity : 0.6568          
##             Specificity : 0.7907          
##          Pos Pred Value : 0.5315          
##          Neg Pred Value : 0.8644          
##              Prevalence : 0.2655          
##          Detection Rate : 0.1744          
##    Detection Prevalence : 0.3281          
##       Balanced Accuracy : 0.7238          
##                                           
##        'Positive' Class : Yes             
## 

Optimized Model Summary (Adjusted Threshold)

To better align the analysis with the business goal of proactive retention, we adjusted the classification threshold from 0.5 down to 0.3. This adjustment instructed the model to flag customers as “high-risk” even if they showed only a 30% probability of leaving. This modification successfully increased the sensitivity to 65.68%, catching 245 churners—an improvement of 78 customers over the initial model. While this shift led to a slight decrease in overall accuracy (75.52%) and an increase in false positives (loyal customers flagged as at-risk), it significantly reduced the number of missed churners from 206 to 128. This version of the model is much more valuable for the marketing team, as it provides a broader and more accurate “hit list” for retention campaigns.


Act: Provide Recommendations

  • Option 1: Create a Campaign to move high-risk Month-to-Month users onto a 1-year discounted plan.
  • Option 2: Improve the onboarding process for “Fiber Optic” customers, as they show high churn in the first 3 months.
  • Option 3: Target “Senior Citizens” with specific technical support packages, as they have a higher-than-average churn rate.